Skip to content

feat: Add DuckDB plugin#633

Open
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
andreahlert:add-duckdb-plugin
Open

feat: Add DuckDB plugin#633
andreahlert wants to merge 6 commits intoflyteorg:mainfrom
andreahlert:add-duckdb-plugin

Conversation

@andreahlert
Copy link
Copy Markdown
Contributor

Summary

  • Add a DuckDB connector plugin for running SQL queries against DuckDB as Flyte tasks
  • DuckDB is an embedded analytical database (like SQLite for OLAP) that runs locally and synchronously
  • Follows the same architecture and patterns as the Snowflake plugin

Features

  • In-memory and file-based database support via DuckDBConfig(database_path=...)
  • Parameterized SQL queries with typed inputs
  • Extension installation and loading (httpfs, json, spatial, etc.)
  • Query results returned as pandas DataFrames via temporary parquet files
  • Automatic cleanup of temporary result files on delete

Design

Unlike Snowflake/BigQuery, DuckDB runs locally with no remote service, so the connector pattern is adapted:

  • Queries execute synchronously in create() (wrapped in run_in_executor for async compat)
  • get() always returns SUCCEEDED since queries complete in create()
  • delete() cleans up temporary parquet result files
  • No credentials, polling, or dashboard links needed

Test plan

  • 25 unit tests covering task creation, config serialization, SQL generation, connector create/get/delete, parameterized queries, extensions, and end-to-end flow
  • All tests pass (pytest plugins/duckdb/tests/ -v)
  • Linting passes (ruff check plugins/duckdb/)
  • Plugin structure matches Snowflake plugin layout

Add a DuckDB connector plugin following the same patterns as the
Snowflake plugin. DuckDB is an embedded analytical database that runs
queries locally and synchronously, so the connector executes queries
in create() and get() always returns SUCCEEDED.

Features:
- In-memory and file-based database support
- Parameterized SQL queries with typed inputs
- Extension installation and loading (httpfs, json, etc.)
- Query results returned as pandas DataFrames via temp parquet files
- Automatic cleanup of temporary result files

Signed-off-by: André Ahlert <[email protected]>
Comment thread plugins/duckdb/src/flyteplugins/duckdb/__init__.py Outdated
Comment thread plugins/duckdb/src/flyteplugins/duckdb/connector.py Outdated
Comment thread plugins/duckdb/src/flyteplugins/duckdb/connector.py Outdated
Comment thread plugins/duckdb/src/flyteplugins/duckdb/task.py Outdated
Copy link
Copy Markdown
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got it a little wrong

@andreahlert
Copy link
Copy Markdown
Contributor Author

I think you got it a little wrong

Thanks for the review! You're right, I should have looked at the existing flytekit DuckDB plugin as reference instead of modeling it after Snowflake. I'll rework this to use TaskTemplate with execute() and add DataFrame input type support.

@andreahlert andreahlert marked this pull request as draft February 8, 2026 23:11
Drop the AsyncConnector pattern (connector.py, dataframe.py) which is
designed for remote services. DuckDB runs locally and needs guaranteed
memory isolation, so the plugin now subclasses TaskTemplate directly
with an async execute() method, following the same pattern as
ContainerTask.

Changes:
- Accept pd.DataFrame, pa.Table and flyte.io.DataFrame as inputs
  registered as virtual tables via con.register()
- Support parameterized queries with ? and $N placeholders
- Support multi-query execution (list of queries)
- Support runtime queries via 'query' string input
- Remove connector dependency and entry-point from pyproject.toml
- 26 tests covering execution, DataFrame inputs, params, multi-query,
  runtime queries, extensions and serialization

Signed-off-by: Andre Ahlert <[email protected]>

Signed-off-by: André Ahlert <[email protected]>
Required for DataFrame input registration and Arrow table output.
Without these the CI test environment fails on import.

Signed-off-by: Andre Ahlert <[email protected]>

Signed-off-by: André Ahlert <[email protected]>
- Guard against empty query list (query=[]) which would cause
  AttributeError on None.to_arrow_table()
- Use startswith("insert") instead of "insert" in query to avoid
  false matches on column names like insert_date
- Add tests for both edge cases

Signed-off-by: Andre Ahlert <[email protected]>

Signed-off-by: André Ahlert <[email protected]>
DuckDB's con.execute() always returns a result object even for DDL
statements, so the None guard was unreachable. Replace with a test
confirming DDL queries return an empty DataFrame.

Signed-off-by: Andre Ahlert <[email protected]>

Signed-off-by: André Ahlert <[email protected]>
@andreahlert andreahlert marked this pull request as ready for review March 31, 2026 03:12
@andreahlert andreahlert requested a review from kumare3 March 31, 2026 03:12
Comment thread plugins/duckdb/tests/test_task.py
…rames

Handle DataFrame inputs that have _raw_df populated (from wrap_df/from_df)
by converting directly to Arrow Table instead of trying to open via URI.

Signed-off-by: André Ahlert <[email protected]>
@andreahlert andreahlert requested a review from kumare3 March 31, 2026 03:32
@samhita-alla
Copy link
Copy Markdown
Contributor

i'll review the PR later this week. thanks for porting it over!

@andreahlert andreahlert changed the title Add DuckDB plugin feat: Add DuckDB plugin Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants